11 research outputs found

    Network metrics for assessing the quality of entity resolution between multiple datasets

    No full text
    Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings

    Contextual entity disambiguation in domains with weak identity criteria: Disambiguating golden age amsterdamers

    Get PDF
    Entity disambiguation is a widely investigated topic, and many matching algorithms have been proposed. However, this task has not yet been satisfactorily addressed when the domain of interest provides poor or incomplete data with little discriminating power. In these cases, the use of content fields such as name and date is not enough and the simple use of relations with other entities is not of much help when these related entities also need disambiguation before they can be used. Therefore, we propose an approach for the disambiguation of clustered resources using context (related entities that are also clustered) as evidence for reconciling matched entities. We test the proposed method on datasets of historical records from Amsterdam in the 17th century for which context is available, and we compare the results of the proposed approach to a gold standard generated by three experts, which we make available online. The results show that the proposed approach manages to meaningfully use context for isolating identity sub-clusters with higher quality by eliminating potentially false positive matches

    Managing metadata for science, technology and innovation studies: The RISIS case

    Get PDF
    Here, we describe the RISIS-SMS metadata system, developed to support the use of heterogeneous datasets in the field of Science, Technology and Innovation Studies (STIS). These data are partly within the RISIS infrastructure, but often elsewhere. The system has three aims: (i) to help researchers to search for and understand data that will help to answer specific research questions, without having to access or download the data. As datasets often have restricted access, browsing metadata is a key feature of the system: researchers need help identifying the relevant data from different sources for their research, and for which data it is worthwhile asking for access; (ii) to support the enrichment of data By linking the metadata system to the Linked Open Data environment (LOD); (iii) to facilitate application-driven data integration

    Is my:sameAs the same as your:sameAs? Lenticular lenses for context-specific identity

    No full text
    Linking between entities in different datasets is a crucial element of the Semantic Web architecture, since those links allow us to integrate datasets without having to agree on a uniform vocabulary. However, it is widely acknowledged that the owl:sameAs construct is too blunt a tool for this purpose. It entails full equality between two resources independent of context. But whether or not two resources should be considered equal depends not only on their intrinsic properties, but also on the purpose or task for which the resources are used. We present a system for constructing contextspecific equality links. In a first step, our system generates a set of probable links between two given datasets. These potential links are decorated with rich metadata describing how, why, when and by whom they were generated. In a second step, a user then selects the links which are suited for the current task and context, constructing a context-specific “Lenticular Lens”. Such lenses can be combined using operators such as union, intersection, difference and composition. We illustrate and validate our approach wit

    Amsterdamers from the Golden Age to the Information Age via Lenticular Lenses: Short paper

    No full text
    The Golden Agents infrastructure project1closely collaborates with the Amsterdam CityArchives (SAA) to publish their digitized registries as Linked Open Data (LOD). In their AllAmsterdam Acts project2, the SAA digitizes/indexes all its notarial acts. For Golden Agents,studying the interactions between the production and consumption of the creative industriesof the Dutch Golden Age is relevant because the probate inventories, testaments, etc. in theseacts reveal the objects that families living in Amsterdam had in their houses. However, to linkthese data to other relevant collections we need to disambiguate names to identify individuals.This is a challenging task because (i) citizens in the Dutch Golden Age were not given anyidentification number; (ii) the information supplied in a single index do not suffice to uniquelyidentify an individual (weak identity criteria) and (iii) because of multiple occurrences of a sin-gle individual within an index. Here we discuss the requirements we identified for addressingthis challenge (Sect. 2); the Lenticular Lenses as an innovative context-sensitive entity linkingmethod (Idrissou et al., 2017) (Sect. 3) and our first experiments applying this tool for con-necting three SAA indexes (marriage, baptism and probate inventories) and two authoritativedatasets (Ecartico3and ULAN4) (Sect. 4

    Is my:sameAs the same as your:sameAs?

    No full text
    Linking between entities in different datasets is a crucial element of the Semantic Web architecture, since those links allow us to integrate datasets without having to agree on a uniform vocabulary. However, it is widely acknowledged that the owl:sameAs construct is too blunt a tool for this purpose. It entails full equality between two resources independent of context. But whether or not two resources should be considered equal depends not only on their intrinsic properties, but also on the purpose or task for which the resources are used. We present a system for constructing contextspecific equality links. In a first step, our system generates a set of probable links between two given datasets. These potential links are decorated with rich metadata describing how, why, when and by whom they were generated. In a second step, a user then selects the links which are suited for the current task and context, constructing a context-specific “Lenticular Lens”. Such lenses can be combined using operators such as union, intersection, difference and composition. We illustrate and validate our approach wit

    Documenting the Creation, Manipulation and Evaluation of Links for Reuse and Reproducibility

    No full text
    Data integration is an essential task in the open world of the Semantic Web. Many approaches have been proposed that achieve such integration by linking related entities across data providers, but they lack the support for in-depth documentation of the involved processes such as the creation, manipulation and evaluation of links. As a consequence, detailed documentation that eases the understanding and reproducibility of underlying processes is needed for a reliable reuse of graphs of identity available in the Semantic Web. We present here an approach to document such links and their processes, building upon a representation we call VoID+. It enables link-publishers to provide data-users with information that better support them in accessing and using links. We show that our approach with the proposed VoID+ ontology allows us to address the relevant competency questions around the reuse of integrated Semantic Web data. We also demonstrate how our approach has been successfully implemented in the Lenticular Lens, a user interface tool that annotates links it discovers, manipulates or validates under user’s guidance. Based on a real-life humanities case study, we can show that the ontology amply annotates links in its life-cycle for reliable decision making by data-users
    corecore